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In this work we suggest a statistical mechanics approach to the classification of high-dimensional 
data according to a binary label. We propose an algorithm whose aim is twofold: First it learns 
a classifier from a relatively small number of data, second it extracts a sparse signature, i.e. a 
^vQ , lower-dimensional subspace carrying the information needed for the classification. In particular the 

second part of the task is NP-hard, therefore we propose a statistical-mechanics based message- 
passing approach. The resulting algorithm is firstly tested on artificial data to prove its validity, 
but also to elucidate possible limitations. 

As an important application, we consider the classification of gene-expression data measured 
in various types of cancer tissues. We find that, despite the currently low quantity and quality of 
available data (the number of available samples is much smaller than the number of measured genes, 
limiting thus strongly the predictive capacities), the algorithm performs slightly better than many 
state-of-the-art approaches in bioinformatics. 

o 

PACS numbers: 07.05.Kf Data analysis: algorithms and implementation, data management; 02.50.Tt Infer- 
ence methods; 98.52.Cf Classification and classification systems; 05.00.00 Statistical physics, thermodynam- 
ics, and nonlinear dynamical systems 
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I. INTRODUCTION 



Extracting information from high dimensional data has become a major challenge in biological research. The 
development of experimental high-throughput techniques allows to monitor simultaneously the behavior of genes, 
proteins and other cellular constituents on a genome-wide scale. Linking the gene-expression profiles of specific 
tissues to global phenotypic properties, as, e.g., the emergence of pathologies, is one of the most important goals 
of this kind of studies. Particular attention is paid on cancer tissues, where the ability of classifying such tissues 
according to their cancer type has immediate impact on establishing an appropriate medical treatment [H, S S 0] • 
Global gene-expression profiling gives a new perspective in this direction. 

In this work, we consider micro-array data, which measure the abundance of messenger RNA as a mark of gene 
expression. More precisely, the logarithm of the relative abundance compared to a reference pattern (e.g. normal 
tissue or average over many tissues) is recorded, such that positive values correspond to over-, negative to under- 
expression with respect to the reference values. Each micro-array contains the information about the expression levels 
£C) • of thousands of genes, while the number of micro-arrays available for a given problem usually ranges from some tens 

£. | , to few hundreds. These relatively few, but high-dimensional and, as a further complication, noisy expression patterns 

(*-•>) render the classification task computationally hard. 

| A promising strategy for solving this problem is that of systematically reducing the space of the system by isolating 
. small set of genes that are thought to be particularly relevant for the problem. Such a set is often called a gene 
signature. There are three major reasons: (i) One may hope that the selected genes have some biological relevance 
• i-h , elucidating processes related to the pathology, (ii) The reduction of the number of considered genes improves the a 
' priori unfavorable ratio between data dimension and pattern number, and reduces the risk of over-fitting data, so that 
better predictive capabilities are to be expected, (iii) Custom-designed chips monitoring only the selected genes can 
be constructed, reducing noise and cost of the experiments. This would lead to an increased quality and quantity of 
data, and thus to better possibilities of extracting biologically relevant information. 

The problem of extracting a sparse signature from high-dimensional datasets is ubiquitous in many different fields 
ranging from computer science, combinatorial chemistry, text processing and finally to bioinformatics , and different 
approach have been proposed so far (see 0, @ for a general introduction to the problem and Q for a detailed 
discussion about cancer classification based on expression patterns). The literature about feature selection has so 
far concentrated around three different strategies: (i) wrapper which utilizes learning to score signatures according 
to their predictive value, (ii) filters that fix the signature as a preprocessing step independent from the classification 
strategy used in the second step. In our work we will present a new wrapper strategy which fall into the subclass of 
embedded methods where variable selection is performed in the training process of the classifier. 

Two complementary approaches can be considered to face a classification problem: An unsupervised approach 
tries to suitably regroup/cluster samples according to some inherent similarity measure [H, [^], whereas supervised 
approaches exploit a (partial) labeling of the data (e.g. available diagnoses for part of the patients) and aim at at 
learning a rule which allows to label also previously unlabeled data (e.g. patients without given diagnosis). The main 
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problem of a supervised approach for micro-array data is the already mentioned small number of available labeled 
samples (tissues already classified) compared with the high dimensionality (the number of genes that a priori have to 
be taken into account) of each of them. On the other hand, a supervised approach allows to capture the information 
that is relevant for the sought classification, avoiding to be misled by other structures present in the data [Toj |. 

In this paper we propose a generalization of a message-passing based algorithm to supervised classification. The 
power of the algorithm lies in its statistical physics approach, that allows (i) to deal with the combinatorial nature 
of the effect of relevant genes and (ii) to characterize the statistical properties of the set of all possible classifiers, 
weighted by their performance on the training data set, and by the number of genes on which the classifier actually 
depends. The method consists in translating the problem into a constraint satisfaction problem (CSP), where each 
constraint corresponds to one classified training pattern. This CSP is solved using belief propagation [111 ], which in 
statistical physics is also known as the Bethe-Peierls approximation. The final output of the algorithm is twofold: 
First a gene signature is provided, and, second, the classifier itself is given. 

We test this algorithm on artificial data, we then move to data sets of leukemia, prostate and colon cancer, used as 
benchmarks for several other alg orithms [121 ] and we finally consider a set of breast cancer data obtained as the union 
of two different experiments (l3l 

The paper is organized as follows: In Sec. |TT] wc introduce the problem of classification using first a statistical- 
mechanics formulation, then we show the equivalence of this formulation with the Bayesian formalism in analogy with 
the work of Kabashima et al. [HI]. In Sec. IIIII we give the details of the algorithm. In Sec. IIVI we show results on 
artificial data and in Sec. [V] we show and comment the results on RNA micro-array data. Finally, in Sec. I VII we 
present conclusions and perspectives of our approach. 



II. THE CLASSIFICATION PROBLEM 



Given are M patterns (micro-arrays) measuring the simultaneous activity irf of N genes, with i = 1,...,N and 
fx = 1, M. The value of x% gives the logarithm of the ratio between the activity of gene i in pattern fj, and some 
reference activity of gene i, so positive xf correspond to over-expression, negative to under-expression of gene i in 
pattern fj,. In addition we have an M -dimensional output vector {y^} E {±1} M that assigns a binary label to each 
pattern (e.g. cancer vs. normal tissue, or cancer type-1 vs. cancer type-2). The full data set is thus defined as a 
set of input-output pairs D — {{x 1 , y 1 ), . . . , (x AI ,y M )}. 

Here we concentrate fully on binary classifications, but this restriction can be easily overcome. A straightforward 
generalization to deal with many classes is the so-called one against all classification, where one single label is classified 
separately against a set unifying all other labels. This reformulation allows to treat a q-label problem via q—1 binary 
problems [161 ]. 

As explained before, we aim at extracting a sparse signature out of the set of all N genes, and at establishing 
a functional relation between the selected genes and the labels y M . This task seems infeasible in its full generality, 
therefore we restrict ourselves to the simpler task: we will provide a ternary variable Ji E {0, ±1} for each gene, which 
describes the influence of gene i on the output label: 

!— 1 if over-expression of gene i favors label y = — 1 and its under-expression favors label y = 1, 
if gene i contains no (additional) information about label y, (1) 
1 if over-expression of gene i favors label y = 1 and its under-expression favors label y = — 1. 

We further on aim at finding variables Ji such that as few entries as possible are non-zero, forming thereby the desired 
sparse signature. Note that in this scheme Ji = does not imply that the output y^ is independent of the input if, 
but that gene i does not carry additional information about the output y 11 as compared to the genes in the signature 
(i.e., with non zero coupling). 

This classification scheme is clearly an oversimplification with respect to bio-medical reality, where a whole range of 
positive and negative interaction strengths is to be expected. On the other hand, given the peculiar restriction posed 
by the limited number of available experimental gene-expression patterns, having a simple (and hopefully meaningful) 
model reduces the risk of over-fitting and produces results which arc easier to interpret. 

An algorithm for binary classification based on Belief Propagation was already proposed by Kabashima [HI , where 
he considers continuous values for the couplings Ji, coupled with binary variables Cj establishing the presence of the 
gene i in the classification task (cj = 1) or its absence (c, = 0). The algorithm is tested only the Colon dataset and 
our results give a sensibly better generalization error. 

We further explore the possibility to improve our results allowing each Ji to take q > 3 discrete values. In all the 
performed simulations the generalization ability of the algorithm remains stationary or decreases when increasing q. 
We thus keep the simpler scheme considering the only possibilities Ji E —1,0,1. 
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A model of the functional relationship between input and output variables (the data set D) has to be formulated to 
proceed. Again we aim at keeping this model as simple as possible. We consider functions depending only on the sum 
of the input variables weighted by the coupling vector. This choice, given the Boolean nature of the output variables, 
results in a linear classifier [13], i.e. a perceptron, 

V" = sign (^2 + r ) ( 2 ) 

where the function sign (x) gives the sign of the input x, and sign(O) is set to +1. 

The full classification problem reduces thus to the inference of a coupling vector J = (J\, Jjv). 



A. Inference as a constraint satisfaction problem 

The coupling vector J and the threshold r are the free parameters of the problem. Following the usual strategy 
in statistical mechanics (l8| . we can define a cost function (or Hamiltonian) that, given the data set D, counts the 
number of patterns contradicting our threshold model as a function of the coupling vector J (and of the threshold r; 
this dependence will be taken as implicit from now on): 

w (j) = E 6 |-^(E j ^ + t )} . ( 3 ) 

with O being the Heaviside step function. This cost function does not include any dependence on the sparsity of the 
coupling vector J, so, to obtain a vector with as many zero entries as possible, we have to add an external pressure to 
our system. From the point of view of a model Hamiltonian, this can be obtained by an external field h (or chemical 
potential) being coupled to the number of non-zero entries in J, i.e. to 

N 

N ef f(J)=^2\Ji\ , (4) 
i=i 

The complete Hamiltonian for our system thus reads 

H(J) = H (J) + hN cS (J) ■ (5) 

Searching for the minimum of this cost function is the analogous of solving a zero temperature problem in statistical 
mechanics. 

This would be the correct procedure if the data were completely clean from noise and perfectly linearly separable. 
When this is not the case, imposing the correct classification for all the patterns could lead to an improper selection 
of the coupling vector J and consequently to a poor prediction ability. 

Keeping this in mind, we consider a "finite temperature problem" instead, where also solutions that leave unsatisfied 
some clauses are allowed. We see that in many cases these are the solutions that give the better predictions on new 
data. At this aim, we introduce a formal inverse temperature (3 and the related Gibbs measure 

P G ibbs( J) cc exp(-/?H - hN cS ) (6) 

with h = (3h. The values of j3 and h set the relative importance of satisfying the constraints given by the patterns 
versus the sparsity of the coupling vector. Large /? enforces satisfaction of the constraints, large h favors many zero- 
elements in J. At the moment these two parameters are free model parameters and we describe later on a strategy 
to fix them for specific data sets D. 

However, our approach is not seeking for one single configuration that minimizes the cost function; the objective 
is to characterize the statistical properties of all low-cost coupling vectors as weighted by fbibbs(</)- Once known 
these statistical properties, a subsequent problem is to characterize a proper classifier. We will discuss this point in 
the next subsection, where, to clarify the relation between our statistical mechanics and the Bayesian approach, we 
reformulate the problem using the point of view of the latter. 
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B. A Bayesian point of view 

The statistical-mechanics approach outlined so far can be reinterpreted in terms of the following Bayesian inference 
scheme |l9j . As a first step, we define a model that describes the likelihood P({j/ m }^=i,...,m | J) {£ m }>=i,...,m) 01 
a labeling {y AI } M= i,... l Af given a parameter vector J and the expression data {x m } m =i,...,m- As usually we assume 
different data points to be generated independently under the model, i.e. 

M 

P (W)>, i u I J> {^}m=i,...,m) = I] W I ^ ^) ' ( ? ) 

with the likelihood of a single label being defined coherently with the previous subsection, 

" N 



P(y» | J, jg") = — ^ exp | -/30 



Further on we define a prior on the space of parameters which favors again sparse vectors: n(J) oc 
exp{— hN e s( J)}(6( Ji] 0) + <5(Jj; — 1) + 8{Ji] — 1)) and where we used the Kronecker delta function to enforce that 

J 4 e{0,±i}. 

A straightforward application on Bayes theorem allows us to derive the posterior probability of J given the knowledge 
of the data-set D as: 

M 

P(J | D) cx P({y»] | J, {aP}) n(J) = [] W I X #*) n( J) (9) 

which is the analogous of Eq. ([6]). 

Let us now see how the knowledge of this posterior probability can be used in a given case of classification. Imagine 
that we have experimental access to a data-set D of M gene-array measurements D = {(x 1 , y 1 ), . . . , (x M ,y M )}. Now 
a new expression measure x° becomes available, but we do not know whether the experimental sample comes from 
a cancerous tissue or not, i.e. we do not know its annotation label y . We can, however, compute the conditional 
probability of y° given the knowledge of all experimental measurements in D and the new expression profile x°. 
Applying the sum rule: 

P(y°\D,aP) = J2P(y°J\ D > aP) = '£P(y \J,aP)P(J\D) , (10) 

we are ready to establish two probabilistic rule to assign the label y° to a sample x° on the basis of the posterior 
probability P(J \ D). To simplify the notation, let us define the quantity: 

N 

H = J2 Ji4 (11) 

i=i 

We can then classify the new pattern x° according to the Bayesian rule (|10|) . If we assume that plj is a sum of 
random independent variables, we get that the field H is approximately a Gaussian distributed random variable. In 
this way we can relate the probability distribution of H with the conditional probability of the output y°: 

<y°> = P (y° = 1| A 3°) - P (y° = -l\D t *>) = £ f ) P(J\D) 

{j} \ > 

r / -/30(-ff) -/36(H) \ 1-e" /" 

~ 7 ( l + e-P TTe~Pj G W,(^){H)dH = Y~~Ti J dH <V),<Aff*>(if) sign(ff) (12) 

where G rn a (H) is a Gaussian distribution with mean m and variance a. In our case these two parameters are given 
by: 

N 



(H) = ( 13 ) 

i=l 
N 

(AH 2 ) = (H 2 ) (H) 2 = £ {(A) - (Jf) {<? ■ (14) 



»=i 



■5 




FIG. 1: Factor graph representation of the problem. Circles are variables nodes, corresponding to single coupling variables 
Ji, rectangles represent factor nodes, the lower ones corresponding to data-given constraints, the upper (trivial) ones to the 
external diluting field. Links indicate functional dependence. 



The averages (•) in the previous equation are taken over the posterior probability distribution P(J | D). The choice 
of the label y° can be simply given by the maximum posterior probability criterion: 

y° = sign«y }) = sign«if)) . (15) 

It is worth pointing out that for computing both (y°) and (H) we do not need the knowledge of the whole posterior 
probability. The knowledge of the single-coupling marginal posterior probabilities Pi(Ji \ D) = ^{j,}^ P(J I D) is 
sufficient. In order to lighten notations, we indicate marginal posterior probabilities simply by Pj(J,), dropping the 
explicit dependence on the data set D. 

In the following section we introduce an efficient algorithm for estimating these quantities. 



III. THE ALGORITHM 



The message-passing strategy introduced in the following allows to efficiently estimate marginal probability distri- 
butions for single entries of the coupling vector. Given the Gibbs measure ([6]) one could in principle compute the 
marginal probability distribution of variable i using the standard definition: 

P^Ji) EE J2 P Gibb S (J) • (16) 

Unfortunately this strategy involves a sum over 3 N_1 terms and becomes computationally infeasible already for N 
larger then 30, as compared to thousands of gene-expression values measured in each micro-array experiment. In the 
next section we will explain how to overcome this difficulty by using the Bethe-Peierls approximation. 

Note that the marginal distributions Pi(Ji) contain valuable information for the extraction of a sparse signature. 
Genes i with a high probability of having a non-zero coupling even for large h have to be included in such a signature, 
they are likely to carry crucial information about the output label. Genes i with high probability Pj(Jj = 0) on the 
other hand can be excluded from most signatures, so their information content is either low or already provided by 
other genes. 



A. The message passing approach 



The belief propagation (BP) method, also known as Bethe-Peierls approximation, or cavity approach, in statistical 
physics, is exact on tree-shaped graphical models. As an approximate tool, it was therefore mostly used on sparse 
graphs [ill HH , where the influence of loops is expected to be not very strong. More recently, several applications 
to dense graphs have been successfully proposed [ljl [l8|, [22|, HH, HH HB|- Our problem can in fact be considered 
as a graphical problem over a fully connected bipartite factor graph, cf. Fig. [TJ with N vertices (variable nodes) 
representing the N variables Ji, and M vertices (factor nodes) representing the M constraints All pairs of these 
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two vertex types are connected, since each constraint depends on all variables. In addition there are local fields on 
variables corresponding to the diluting term in Hamiltonian ([5]). 

The BP algorithm provides a strategy for estimating marginal probability distributions. It works via iterative 
updates of messages, which are exchanged between variable and factor nodes. Let p, be one of the factor nodes and i 
one of the variable nodes. We can introduce the following messages, which travel according to the direction indicated 
by the indices: 



Pu—>i(Ji) describes a weight imposed by constraint \x on the value Ji of the coupling of variable i. 
• Pi— >n{,Ji) is the probability that variable i takes value Ji in the absence of the constraint set by factor node fx. 
The above-defined quantities satisfy the following set of self-consistent BP equations: 



N 




P l ^{J i ) = d^e-^ p„^i{Ji) (18) 



where Ci^ M is a normalization constant. A more detailed derivation of the equations in the case of the perceptron 
without dilution can be found in fl8l |. and in 10] in the case of dilution. 

In the algorithm both types of messages are initialized randomly, and the iteration proceeds via a random sequential 
update scheme. The algorithm stops when the convergence is reached, i.e. when the difference between each message 
at time t and the corresponding one at time £ — 1 is less than a predefined threshold (10 -12 in all our simulations). 
The number of needed iterations depends on the particular problem and on the values settled for the parameters /? 
and h. However, we do not exceed a hundred of iterations in the simulations reported above. 

Once all messages are evaluated, the desired marginal probability is given by the messages sent from all factor 
nodes and by the diluting field, 

Pi{Ji) = C z e- hW Y[p^ (Ji) (19) 

V 

where the Ci are normalization constants which can be easily determined by tracing the unnormalized expression over 
the three values Ji = 0, ±1. 

Note that in the definition of messages (JT7J) we still have a sum over 3^"* configurations, but the y e nter in the 
expression only through a linear combination, which enables us to use again a Gaussian approximation [18j |. The sum 
over couplings is replaced by a single Gaussian integral, which can be performed explicitly: 



/v , i ./, ) - / exp J J z ^ exp { _ 0Q + j.^))} ( 2o) 

' ' ' eTi \ y nit=l±Jfi ), (21) 



where Zu_—>i is a Gaussian variable with mean and variance, 



„2 



The averages (-)j->n are performed over messages Pj^^(Jj). In this way the complexity of Eq. (jTTJ) is reduced from 
0(3^) to 0(N) and that of the overall iteration to 0(MN) (The apparent complexity 0(MN 2 ) of updating MN 
messages in time 0(N) can be reduced to 0(MN 2 ) by a simple trick: The sums in Eqs. can be calculated 
over all j once for each /i, so only the contribution of i has to be removed in the update of p^i for each i. This 
allows to make the single update step in constant time) . A precise estimate of the overall complexity of the algorithm 
would require to control the scaling of the number of iterations needed for convergence. A theoretical analysis of BP 
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convergence times in a general setting (including the perceptron case) remains elusive. Some recent progress for the 
simpler matching problem can be found in [26[. 

The BP equations become feasible even for very large N and M, and can therefore be applied to biological high- 
throughput data sets. Note that, even if the central limit theorem is meant to be valid in the limit of N — ► oo, in 
practice it works very well already for N = 4 (where the exact computation is clearly feasible). 

The Bethe entropy can be computed from marginals and messages: 

M N 

Seethe = -5^5^P M (j)lnP M (J) + (M-l)5^ Pi(Ji)^Pi(Ji) 

M=l j *=1 Ji=— 1,0,1 

M M N M 

= -J^lnC-J^C**^ Yl P^i{Ji)Pi^M)^P i ^M) + PY; C ' lE » 

fi=~L m=1 i=l ■/«=-!, 0,1 IJ,=1 



N 



-(m-i)J2 E Pi(Ji)foPi(Ji 



i=l J 4 =-1,0,1 

In the first line, we have used the distribution 



(23) 



N 



N 



P„(J) = C exp -0Q -y» £ J a x* J] P i^M 



(24) 



j=i 



which describes the influence of a single factor node by conserved marginal distributions. In the second line, we use 
the corresponding single sample energy c^E^ defined as 



E„ 



E e ( -y" E Js < J exp { - /3e ( - yAl E ) } n p ^ ^ 



l_ erf ^ 



(25) 



In writing the last expression we have used again the Gaussian approximation as in Eqs. (|20ti22jl . 

Up to this point, the BP equations still depend on two undetermined parameters, namely the inverse temperature 
(3 coupled to the data-given constraints, and the diluting field h. To implement the algorithm we have to define a 
strategy for fixing these free parameters: 

• The diluting field h is the conjugate variable of the number of effective link N e g(J), so we can equivalently fix 
one of the two quantities. We decided to fix the number of effective links, and thus the size of the searched 
gene signature, and to choose h accordingly. To find the correct value of h we apply a cooling procedure, 
where after each interaction of the BP equations step we increase (resp. decrease) h depending on whether 
the effective number of link is higher (resp. lower) than the desired value. Since the true number of relevant 
genes is an unknown quantity, the chosen value for N e s(J) is, however, a free parameter. Practically, since the 
algorithm finds marginal probabilities and not a single configuration, we define the number of relevant links via 
its thermodynamic average: 



(iV eff H £[1-^(0)] 



(26) 



Comparing results for different values of (N e g) we see that the algorithm is fairly robust, as will be seen in the 
following sections. 

• The inverse temperature /3 is again fixed by a cooling procedure starting from a low value and increasing it 
until one of the following two conditions is met: (i) the energy reaches a small enough value (/3 — ► oo formally 
corresponds to zero energy), (ii) the entropy goes to zero (a signal for a freezing transition into a non-extensive 
number of coupling configurations at finite temperature) . In this last case we use the marginals computed at 
the zero-entropy temperature. 
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The diluting field h drives the system toward solutions of the desired average number of effective coupling (N e g). 
This is not yet enough for determining explicitly one signature of the desired size, since results are still of probabilistic 
nature. If we want to select, and use in our algorithm, only a desired number of genes, we have to couple the BP 
algorithm with the decimation procedure presented in [271 . |28| : 

1. Random initialization of messages. 

2. Convergence of the BP Eqs. (fT"7|) and (fT"8|) at the correctly self-tuned values of h and (3. 

3. Computation of marginal probability distributions using Eq. (|19p . 

4. Setting to zero the coupling variable Ji having the largest weight in zero (i.e., set Pi(Ji = 0) = 1 for this 
variable). 

5. GOTO 2. Repeat until N — N c s classification variables are set to zero. 

Practically, this procedure turned out to be computationally much too expensive, so we opted for a faster variant of 
Step 4 of the decimation procedure: after each convergence step of the BP equations, we rank genes according to 
fj(0), and we set to zero an extensive part of couplings at the top of this ranking. The same procedure is iterated 
until we reach the desired number of non zero weights Ji. 



B. A note on the centroids algorithm 

It is interesting to compare the results of the BP algorithm with simpler, but widely applied techniques. The 
centroids algorithm is based on the notion of distance (Euclidean distance in iV-dimensional space in this case) of a 
given pattern from the centers of mass of the two sets of patterns with annotations = 1 or — 1. The algorithm 
works in the following way: 

• Let B z = {n\y^ = z}, and M z = \B Z \ the number of patterns with label z. Here we consider z = ±1 only, but 
the algorithm works equivalently for multiple classes. We compute the centers of mass (or centroids) c z of the 
expression data with labels y = z: 

«, = £ £ * w 

z to a new sample x if its distance from c z is smaller than its distance from all other 
y° = argmin(P°- C ;||) , (28) 

Z 

where II. II can be the Euclidean distance in or, depending on the problem, any other meaningful notion of 
distance. 

In order to take into account that only a subset of the genes is relevant (sparse signature), we rank genes i = 1, ...,N 
according to the absolute value of their Pearson correlation C\ with the output: 




The s highest scoring genes are selected as the signature, and the distance of the new pattern x from the centroids 
is computed taking into account only these genes, i.e. the full problem is projected onto the s-dimensional subspace 
of all genes. Since the signature size s yielding the best classification is not known a priori, one has to consider the 
performance of the algorithm as a function of it. 



• We assign the label y° = 
centroids, 



IV. TEST ON ARTIFICIALLY GENERATED DATA 



Before running the algorithm on micro-array data, it is useful to test it on artificial data generated by a controlled 
input-output relation. In the simulations reported here the x% are drawn independently from a normal distribution. 

In order to compare the performance of the BP and the centroids algorithms, we label the random generated 
patterns according to two different rules, reflecting the perceptron and the centroids idea respectively. 
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FIG. 2: Generalization error as a function of the signature size s for artificial data generated according to Eqs. (|30I31[) with 
N = 600 and fci = ki — 0.025 (left and central panel) and according to the centroids rule with k — 6 (right panel). In the left 
and right panel we have Mtrammg = 600 (a = 1), in the central panel Mtrammg = 300 (a = 0.5). In all the cases M t0 st = 300. 
The curves are averages over 50 different realizations of the training and test sets. 
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FIG. 3: Precision-versus-recall curves for the teacher-student scenario. Results for the perceptron-like dataset in the case 
a = Mtraining/^V = I (left panel) and a = Mtraining/A = 0.5 (central panel), and for the centroids-like dataset for a — 
A/training/A^ = 1 (right panel). We display curves obtained from the BP algorithm with decimation (see text) for different 
values of the signature s. The curves derived from the Pearson correlation ranking used in the centroids algorithm are 
significantly lower than the BP ones in the perceptron-like dataset case. The same recall leads to lower precision in this case. 
In the centroids-like dataset case (displayed s — 600), the curves for the two algorithms are comparable but BP continues to 
perform slightly better. Performance goes slightly down for smaller s but stays comparable. In the left and right panel we have 
A/training = 600 (a = 1), in the central panel Mtrainmg = 300 (a = 0.5). In all the cases M tC st = 300. The curves are averages 
over 50 different realizations of the training and test sets. 



A. Description of the datasets. 

In the first series of simulations, we draw the couplings Jj* from the following distribution: 

Px(JP) = (1 - h - k 2 )5jo >0 + ^[djo tl + Sjo^} + ^[Sjoj + S J? ^ 2 } . (30) 

We consider the case k\ + k 2 <C 1 in order to have a sparse gene signature, the expectation value of N e g (defined as 
the number of non-zero entries of J°) equals (fci -I- k 2 )N . Labels are determined according to a rule similar to the 
one used for BP inference, 

^ = si S n (E W) ■ (31) 

If = only J? = 0, ±1 exist, values ±2 are excluded. In this case the data are feasible under model if Eq. [21 and 
data can be learned at zero energy (no wrongly assigned labels by the inferred vector J). The theoretical analysis 
presented in [To| shows a phase-transition line in the plane (a, h), with a — M/N. One phase, at low a and h, is 
paramagnetic. It perfectly memorizes the input-output relation given by the data, but there is no correspondence 
between the exponential number of possible coupling vectors and the data generating rule, the predictive properties 
for a new pattern x are poor. For higher h and/or a, the solution discontinuously jumps to a perfect retrieval of the 
input-output association vector J. A particularly important point is that at sufficiently high diluting field h perfect 
inference is possible at a-values much lower than the critical threshold a c for perfect inference at h = 0. 
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In the case k\ t 2 7^ the data are infeasible, since the structure of the data generator is richer than the one used for 
inference. Couplings of the data generator are allowed to take 5 values ({±2, ±1, 0}), and inference tries to fit data 
just using ternary couplings ({±1,0}). 

In the second series of simulations the patterns x M are labeled according to their similarity to two vectors, x + and 
chosen as representative of the two classes. In order to take into account the sparsity of the relevant genes, we 
consider the variables c, = 1 if i = 1, . . .k and c, = if % = k + 1, . . . N, with k <C N. We thus classify the patterns 
according to the restricted Euclidean distance: 

if = argmin ^ ayj « - x ijZ ) 2 j = argmin ^ ^ « - x^) 2 ^ z e {+, -} . (32) 



B. Results. 



We investigate the goodness of the two classifier, in the two different situations described above, according to two 
different measures. First, we consider the generalization error. In this case patterns are divided into disjoint training 
and test sets. Learning is done on the training set, and the inferred input-output rule is tested against the test set. 
The generalization error is defined as the fraction of misclassified patterns in the test set. Results are shown in Fig. [5] 
For the first type of dataset (perceptron-like) one can appreciate how BP outperforms the simple centroids algorithm. 
In the second type of dataset (centroids- like) , the centroids algorithm outperforms BP in the high signature cases but 
the two algorithms are comparable for low signatures. 

As a second test on the accuracy of BP, we use the fact that the data-generating signature is known. It can be 
compared directly to the inferred one, allowing to group genes into four classes: True positives (TP) are those genes 
which are contained in both signatures, i.e. genes which are correctly identified as being relevant by the inference 
algorithms. False positives (FP) are in the inferred signature, but they are not in the true one. In analogy we define 
true negatives (TN) as those which are in neither signature, and false negatives (FN) are those genes which are in the 
data generator, but they are not recognized by the algorithm. The recall RC (or sensitivity) and the precision PR (or 
specificity) are thus defined as: 

RC = , PR = ^IL . (33 ) 

N TP + N FN N T p + N FP y ' 

The recall RC thus measure the fraction of correctly inferred genes in the signature, while precision PR is the fraction 
of inferred couplings which are actually true ones. These two quantities are obviously competing: An algorithm that 
includes all genes into the signature has very low precision (PR = but maximum recall (RC = 1), while 

including only the genes with very strong signal into the signature may result in a good precision, but at the cost of a 
potentially poor recall. A perfect algorithm would have recall and precision both equal to one, so the interplay between 
both quantities is the relevant " observable" . A curve precision- versus-recall can be constructed by ranking all genes 
according to their probability of having a non zero weight, and introducing different cutoffs in the ranking: In Fig. [3] 
we show the numerical results for the two datasets, while a more detailed theoretical analysis for the perceptron-like 
dataset can be found in [10(. We see that for the data generated by the perceptron like rule, BP actually performs 
considerably better than ranking by correlations, whereas results are comparably good in the case of the centroid-like 
generator. 



V. TEST ON TUMOR DATA 



As stated in the introduction, a common problem of micro-array experiments is the small number of samples in the 
data set (from tens to hundreds) compared with the number (thousands) of monitored genes. Values of the parameter 
a — M/N in the considered data sets range from a — 0.01 to a — 0.01. In principle these values of a are below the 
threshold at which BP outperforms simple pairwise correlation based methods, at least on data artificially generated as 
described in the previous section and more extensively discussed in [101 ] . On the other hand, the simple ratio between 
the number of patterns and the number of genes might not be the relevant parameter on real data sets, due to the 
non trivial correlations between different patterns, and between genes. We first consider three data sets (leukemia, 
colon and prostate cancer), already analyzed in [l2| as benchmarks for other algorithmic strategies (e.g. BagBoost, 
Random Forest, Support Vector Machine, k-nearest neighbors, Diagonal Linear Discriminant Analysis). We further 
study a newer and larger data set on breast cancer from two different laboratories [H, 03] ■ 
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N 


M 


Class A/B 


-/^/training 


Mtest 


NDP 


Leukemia 


3571 


72 


25/47 


48 


24 


200 


Prostate 


6033 


102 


52/50 


68 


34 


100 


Colon 


2000 


62 


40/22 


42 


20 


200 


Breast 


6401 


174 


86/88 


20-160 


154-14 


100 



TABLE I: In this table we display the number of probes N, the number of patterns M, the class composition of the data set 
Class A/B, the size of the training set Mtrammg, the size of the test set Mtest, and the number of different partitions NDP (see 
text) for each of the data set analyzed. 



A. Description of the data sets 

We have analyzed four different data sets of cancer tissues: 

• Leukemia: It consists of 72 samples of two subtypes of leukemia - 25 samples of acute myeloid leukemia (Class 
A) and 47 samples of acute lymphoblastic leukemia (Class B) - measured over 3571 genetic probes (29| . 

• Prostate It consists of 102 samples - 52 from prostate tumor tissues (Class A) and 50 from normal prostate 
tissues (Class B) - measured over 6033 genetic probes [3Cj . 

• Colon: It consists of 62 samples - 40 from colon adenocarcinoma tissues (Class A) and 22 from normal colon 
tissues (Class B) - measured over 2000 genetic probes [3lj |. 

• Breast: The data set we have analyzed is the union of two different experiments presented in [l3l . [LH ]: it 
contains 311 samples measured over 24496 probes. We labeled the samples according to the following criterion: 
a metastasis event occurred in the first 5 years after the appearance of the tumor (Class P [poor prognosis]), 
the remaining samples did not develop a metastasis in this time window and were labeled as Class G (good 
prognosis). In order to reduce the noise we removed all probes with nearly constant expression values across 
the data set. More specifically a probe is included in the data set if at least one of the of the following two 
conditions is met: (i) its variance is larger than 0.1, (ii) in at least ten samples its expression value is outside the 
window (-0.3,0.3). Eventually we ended up with 6401 probes. We further notice that the number of elements 
in Class P (86) is much smaller than those in Class G (225). Since a major aim in this context is to correctly 
classify all members of Class P (those developing metastasis) and not to misclassify them as Class G cases, a 
larger influence of class P data is needed. To obtain a more balanced dataset, we therefore randomly removed 
137 elements from Class G and we end up with a set of 174 patients. 

In Tab. U we resume the details of the data sets. 



B. Results 




— — — 

10 100 1000 



FIG. 4: Generalization error as a function of jV e ff in the colon, prostate, and leukemia data set. 
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FIG. 5: Generalization error computed with Centroids (blue line) and Belief Propagation (red line), as a function of the 
signature size s, shown with a logarithmic scale on the x axes. We display results for leukemia (left panel), prostate (central 
panel), and colon (right panel). 



For each data set we construct different realizations of the training and test sets by randomly permuting the M 
samples. The first M^aming patterns will be the training set and the remaining Mt es t = M — M tra i n ing patterns will 
be used as test set. In this way we are able to obtain results, which are not dependent on a single specific arbitrary 
partitioning of the patterns into training and test sets, and to attribute statistical errors to measured observables. In 
the last three columns of Tab. [J we give the actual values of M tra ining, Attest , and the number of different partitions 
for each of the data sets. 

In all data sets discussed here, we have run BP starting at finite temperature and using the cooling scheme described 
in Sec. lIII M which reached zero temperature. Running the algorithm at various fixed finite temperatures, we observed 
the generalization error to decrease monotonously with temperature, and to saturate at low temperature. Here we 
present directly the zero-temperature results obtained using cooling. 

For the first three data sets (Leukemia, Prostate, and Colon) we first run BP on the entire probe-set with the 
diluting field h defined in Eq. ©. We show in Fig. 2] the generalization error for the three data set as a function of 
N e ff, which, as already explained in Sec. IIII Al can be fixed by tuning the field h. 

In Fig. [4] two different scenarios emerge: the colon and prostate curves display a minimum generalization error at 
a relatively low value of N e g ~ 10, while the leukemia curve is monotonously decreasing. This seems to indicate the 
presence of a small set of probes relevant for the classification in the prostate and colon case, while in the leukemia 
data set it seems that all probes are relevant for the classification. 

Let us recall that a given N e g does not necessarily indicate the actual size of the set of relevant probes. This would 
be the case only when all probabilities P%{J) were completely polarized (i.e. Pj(0) is either or 1). Upon direct 
inspection of our BP results it turns out that this is not the case. No clear threshold is shown on the marginals Pj(0). 
The dilution thus seems to be an effective strategy to attribute differential weights to the probes, but it is not clear 
how to use it to select a relevant signature in the data set. A possible way is to implement a decimation procedure 
as explained in Sec. IIII Al in order to test the performance of the algorithm on a restricted and selected set of probes 
(signatures) of different size. 

Domany et al. in [tJ studied the stability of different signatures as a function of the number of samples used for 
learning. They interestingly noted that the small overlap between different signatures is not only due to the different 
classification strategies used by different groups, but in particular resulting from the small number of available cases. 
Such a lack of robustness emerges even when using the same algorithm on data generated with the same probabilistic 
framework [7J. We have investigated how stable are the lists obtained with our decimation procedure. It turns out 
that an analogous phenomenon of instability occurs in our lists. We observed very few genes appearing in all lists: 
the actual numbers differs among the different data sets but it ranges from 10% in the case of Leukemia, to 0.02% in 
the case of prostate tissues, where however the number of relevant genes seems to be very small as we will see in the 
following. 

To understand how BP compares with simpler algorithmic strategies, we used the centroids method selecting 
signatures of the same size. The results for the first three data set are displayed in Fig. [5] The curve for Leukemia 
displays a monotonously decreasing profile in agreement with what we have obtained fixing N e g shown in Fig. SI This 
seems to indicate that no signature can be defined in this case. The generalization error obtained with the centroids 
algorithm is slightly worse then that of BP. The Prostate case displays dramatic difference in the two algorithms: 
from the BP point of view the curve displays a plateau of minimal generalization error for signature larger than 
100, at odd with what obtained fixing N e s where a minimum is present around N e g = 10. Also in this case it 
seems difficult to determine a clear signature in the data set. Centroids behaves in the opposite way displaying a 
monotonously increasing function with minimal generalization error at values of the signature lower than 10. The 
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BP 


centroids 


BagBoost 


RanFor 


SVM 


kNN 


DLDA 


Leukemia 


0.025(2) 


0.029(2) 


0.0408 


0.025 


0.035 


0.0383 


0.0292 


Prostate 


0.070(4) 


0.071(4) 


0.0753 


0.0788 


0.0682 


0.1059 


0.1418 


Colon 


0.140(6) 


0.156(4) 


0.1610 


0.1543 


0.1667 


0.1638 


0.1286 



TABLE II: In this table we compare t he g eneralization errors of our method (first column) against centroids, BagBoost, RanFor, 
SVM, kNN, and DLDA presented in 

optimal generalization error achieved by the two algorithm in this case is compatible. The Colon case is analogous 
to that of Prostate from the BP point of view (the minimum generalization error is obtained for iV c g = 10 but for 
s = 400), while centroids seems to be rather insensitive to signature size in the whole interval analyzed. In this case 
BP shows an overall better generalization ability. The discrepancy between the values of the signature s and of the 
N e g for which the generalization error is minimum, comes from the not unique set of possible relevant genes. While 
the algorithm is able to select genes which are relevant and weight them adequately (N e g small), it is not able to 
define a single clear signature which alone is sufficient for the classification. This implies that, in the decimation 
procedure, a much higher number of genes, even if with small weight, is necessary to achieve a good predictive ability 
(so that the minimum generalization error is reached for signatures s much bigger than the probabilistic expectation 
given by N c ff). It is worth pointing out that the fact that BP behaves worse for very small sizes, is already present 
in the case artificial data Fig. [5] 

The best generalization errors of both BP and centroids are displayed in Tab. ITT1 compared with the results presented 
by Dettling in [l2j where no statistical error was associated with measures. We see that BP outperforms 4 out of 6 
other algorithms on all three data sets, the other two algorithms perform better on a single data set each. 

In the breast-cancer case, we consider first the generalization error for various sizes of the training set (Mt ra i n ing = 
20, 160). The relatively large balanced data set containing 174 patients allows for the direct analysis of sample 
sizes ranging over almost a full order of magnitude. The results are presented in the left panel of Fig. [6l each point is 
averaged over 50 random selections of the training set. We observe a strong monotonous decrease of the generalization 
error. This is a very encouraging sign and should motivate for collecting larger data sets. The generalization error of 
30% obtained for M tra i n ing = 80 is compatible with the one (31%) reported in (32j. 

However, in the breast cancer set we can go into more detail: The generalization error treats good and poor cases 
(Classes G/P) in the same way. Indeed, one of the open challenges in breast cancer treatment is to recognize the 
correct cancer sub-type at the earliest possible stage of disease, indicating, e.g., the possible sensitivity of the cancer 
to chemotherapy. In the present case, it is strongly preferable to erroneously include a Class G patient into Class P 
than vice versa; it has to be avoided to predict the absence of metastasis for a patient who actually develops one, and 
possibly not to provide the necessary medical treatment to such a patient. 

We therefore introduce four different subclasses: 

1. patients in Class G (no metastasis) correctly classified as Class G, their number is called Mqq; 

2. patients in Class P (metastasis) misclassified as Class G, their number is called Mpg; 

3. patients in Class G (no metastasis) misclassified as Class P, their number is called Mqp; 

4. patients in Class P (metastasis) correctly classified as Class P, their number is called A/pp. 

As said before, our primary aim is to classify correctly Class P patients. The fraction Mpc/(Mpp + Mpq) of 
misclassified Class P patients has to be kept as small as possible; we refer to it as the poor error fraction (PEF). Once 
this is kept small, we also would like to recognize correctly as many Class G patients as possible. The corresponding 
fraction in the total Class G sample is Mqq / (Mqq + Mqp), we refer to it as the good prediction fraction (GPF). 

The GPF vs. PEF curve for a perfect classifier would be constantly equal to zero for the full GPF range, while 
a random classifier would produce a curve PEF=GPF. Here we want to characterize the relation of these two com- 
peting quantities (a constant prediction as P would have PEF=GPF=0, whereas a constant prediction G would have 
PEF=GPF=1), keeping however in mind that we want to keep the PEF low. The full curve is obtained done by 
changing the threshold r defined in Eq. @ of the inferred classifier. 

The result is displayed in the right panel of Fig. O for training sets of size M tra inmg = 100, 160. For each possible 
PEF value we have averaged the curve over 50 random balanced partitionings of our data into training / test sets. 
When training with 100 samples, we have 37 points corresponding to the possible numbers of misclassified Class P 
patients (0 to 36). When training with 160 samples, this number reduced to 8 points. Note that the finding that 
the curve for larger training sets is located right of the other curve is consistent with the observation done in the left 
panel of Fig. [5] Larger training sets lead to better results. It is particularly striking that the PEF starts to grow 
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FIG. 6: Results for the breast-cancer dataset. Generalization error as a function of the training set size (left) and PEF (fraction 
of Class P patients classified as G) vs. GPF (fraction of correctly classified Class G patients). 

much later, the GPF at zero PEF grows from about 10% to about 40%. Again we see that larger data sets should be 
collected for developing more precise prognostic tools. 

In order to compare our result with that of fl3j , we measured the value of the GPF corresponding to an average 
PEF value around 0.07. This can be obtained by averaging over cases with Mpq being at most one. In this case 
we obtained a PEF of 0.076(1) and a GPF of 0.50(2). These values are comparable to [lj] where a PEF=0.071 and 
GPF=0.53 are reported for a set of 180 patients and a leave-one-out procedure (M tra ining = M — 1). 

In conclusion, for the breast cancer dataset, we obtain results comparable with state of the art studies, and we find 
a not saturated increase of performance with growing training set size. 

VI. CONCLUSION AND PERSPECTIVES 

In this work we introduced and analyzed a message passing algorithm for the problem of classification and applied 
it to genome-wide expression patterns of cancer tissues. The aim of the algorithm is twofold: on one hand one would 
like to get a reliable classifier, on the other hand one wants to base this result on a minimal set of discriminating genes. 
This additional requirement of parsimony is based on the biological intuition that, as long as the expression level is 
concerned, only a relatively small subset of genes is actually significant for the development of the disease. Furthermore 
the possibility of identifying a meaningful (possibly small) signature of the disease will help the classification in two 
ways: (i) computationally, the dimensional reduction of the patterns makes the classification less prone to over-fitting, 
(ii) experimentally, specific targeted gene-chip scanning could improve early stage cancer diagnostic (e.g. in the case 
of breast-cancer). 

The performance of our algorithm is found to be slightly better, although comparable with the one of other state-of- 
the-art classification techniques. Using three different data sets, the BP performed better than 4 out of 6 algorithms 
on all data sets, and better than the other 2 algorithms on two out of the three data sets. The obtained results 
are compatible with the following theoretical intuition: The sparse quality (experimental and biological noise) and 
quantity (few expression patterns) of available data does not allow to extract substantially more information than the 
one seen also by simple algorithms. From this point of view the situation will improve in the near future since gene- 
chip technologies are becoming more and more diffused, and the emergence of new array technologies based on longer 
oligo-nucleotides should reduce the experimental noise of the expression measurements. This ongoing technological 
revolution calls for the development of sophisticated global classification techniques which are able to unravel, e.g., 
combinatorial control. 

Another line of development in tumor detection is the integration of different sources of information. From this 
point of view SNPs (Single Nucleotide Polymorphism) appear to be the most promising high-throughput genome- wide 
technology. 
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